Genetic associations with human longevity are enriched for oncogenic genes

Abstract Human lifespan is shaped by both genetic and environmental exposures and their interaction. To enable precision health, it is essential to understand how genetic variants contribute to earlier death or prolonged survival. In this study, we tested the association of common genetic variants and the burden of rare non-synonymous variants in a survival analysis, using age-at-death (N = 35,551, median [min, max] = 72.4 [40.9, 85.2]), and last-known-age (N = 358,282, median [min, max] = 71.9 [52.6, 88.7]), in European ancestry participants of the UK Biobank. The associations we identified seemed predominantly driven by cancer, likely due to the age range of the cohort. Common variant analysis highlighted three longevity-associated loci: APOE, ZSCAN23 , and MUC5B . We identified six genes whose burden of loss-of-function variants is significantly associated with reduced lifespan: TET2 , ATM , BRCA2, CKMT1B , BRCA1 and ASXL1 . Additionally, in eight genes, the burden of pathogenic missense variants was associated with reduced lifespan: DNMT3A, SF3B1, CHL1 , TET2, PTEN, SOX21, TP53 and SRSF2 . Most of these genes have previously been linked to oncogenic-related pathways and some are linked to and are known to harbor somatic variants that predispose to clonal hematopoiesis. A direction-agnostic (SKAT-O) approach additionally identified significant associations with C1orf52, TERT, IDH2, and RLIM , highlighting a link between telomerase function and longevity as well as identifying additional oncogenic genes. Our results emphasize the importance of understanding genetic factors driving the most prevalent causes of mortality at a population level, highlighting the potential of early genetic testing to identify germline and somatic variants increasing one’s susceptibility to cancer and/or early death.


Introduction
Longevity is a complex trait influenced by both genetic and environmental factors and their interactions [1].According to previous studies, genetics accounts for as much as 40% of the heritability of longevity [2][3][4].Identifying the genetic variants that contribute to earlier death or prolonged survival can highlight key biological pathways linked to lifespan and inform genetic testing for general health and screening and enabling precision health.Previous genome-wide association studies (GWAS) have identified over 20 associated loci including APOE [5,6], CHRNA3/5 [7], HLA-DQA1 and LPA [8].Recently, a burden analysis of proteintruncating variants from whole-exome sequencing (WES) data identified four additional genes (BRCA2, BRCA1, ATM, and TET2) linked to reduced lifespan [9].However, most previous research on lifespan genetics has predominantly used proxy data, such as parents' age at death, due to a lack of proband lifespan data.While proxy-based GWAS have been necessary for large cohorts of primarily middle-aged individuals with limited mortality data, this approach restricts the accuracy and scope of findings, as it may fail to comprehensively capture the genetic influences that directly impact an individual's lifespan [10].On the other hand, some studies have employed logistic regression models on cases of extreme longevity and younger controls [11][12][13].This approach may offer new insights by focusing on exceptionally longlived individuals, yet they can be limited and costly.Moreover, replication of borderline significant variants remains an issue due to varying case definitions across studies, with some defining cases as individuals who survive to ages beyond 90 or 100 years or using the 90th or 99th survival percentiles as the age cutoff.In this study, we carried out a genetic analysis of direct mortality data in the UK Biobank, the genetic database with the largest number of reported deaths (35,551 subjects) and aged individuals (344,237 subjects over 60 years old).To assess the association of genetic variants with longevity in a survival analysis, we performed GWAS of common variants imputed from microarray data as well as burden/sequence kernel association test-optimized (SKAT-O) association of rare non-synonymous variants from WES data.

Methods Study participants
The UKB is a large population-based longitudinal cohort study with recruitment from 2006 to 2010 in the United Kingdom [17].In total, 502,664 participants aged 40-69 years were recruited and underwent extensive phenotyping including health and demographic questionnaires, clinic measurements, and blood draw at one of 22 assessment centers, of whom 468,541 subjects have been genotyped by both SNP array and WES.We restricted our analysis to 393,833 individuals who self-reported their ethnic background as 'white British' and were categorized as European ancestry based on genetic ethnic grouping (Field: 21000).Among them, 35,551 subjects were reported deceased, and their ages at death were recorded from the UK Death Registry (Field: 40007).For the other 358,282 subjects without death records, we assumed they were still alive by the latest censoring date (November 30, 2023).We determined their last known ages by subtracting their year and month of birth (Field: 33) from the censoring date.SNP array genotyping and QC A total of 488,000 UKB participants were genotyped using one of two closely related Affymetrix microarrays (UKB Axiom Array or UK BiLEVE Axiom Array) for ~ 820,000 variants.Quality control (QC), phasing, and imputation were performed as described previously [18].Briefly, the genotyped dataset was phased and imputed into UK10K, 1000 Genomes Project phase 3, and Haplotype Reference Consortium reference panels, resulting in approximately 97 million variants.Additionally, we removed SNPs with imputation quality score < 0.3, genotype missing rate > 0.05, minor allele frequency (MAF) < 1%, and Hardy-Weinberg equilibrium  < 1.0 × 10 !$ .

Genome-wide association studies
We performed linear regression models using PLINK v2.0 [19] to test the association of common variants (MAF ≥ 1%) with longevity for the entire cohort, as well as stratified by sex.For all three analyses, we used Martingale residuals calculated using the Cox proportional hazards model as the outcome variable.The procedure for calculating Martingale residuals was as follows.First, a Cox proportional hazards model [20] was fitted without genotype, where  ) () is the baseline hazard function at time point  given the last-known age and dead/alive status, Z = [Z # , … , Z .] is a covariate matrix, and β = [β # , … , β .] is a coefficient matrix for Z.Here, we included sex and the first five principal components (PC) as covariates, but for sex-specific analyses, sex was excluded.Then, Martingale residuals were calculated as:  / 8 =  0 −  ) 8 () ,-1! where  0 is the dead/alive status (0=alive, 1=dead) of the th subject and β < is the estimated coefficient matrix.We adapted the coxph function from the survival (v.3.6.4)R package [21] to compute the Martingale residuals.Genome-wide significance threshold was set at the standard GWAS level (=5.0× 10 !" ).We used LocusZoom [22] to generate regional plots and Python v.3.7 to create Manhattan plots.

Gene expression and colocalization analysis
To evaluate the effect of the significant loci identified in our GWAS, we examined expression quantitative trait loci (eQTLs) across 49 tissues having at least 73 samples from the Genotype-Tissue Expression Project (GTEx) version 8 [23].Bayesian colocalization analysis was employed using the COLOC package (v.5.2.3) [24] in R and the posterior probability of colocalization (PP4) was calculated between GWAS findings and eQTL associations within a 1 megabase (Mb) window.Additionally, colocalization was visualized using the locuscompareR package [25].Whole-exome sequencing and QC Whole exome sequencing (WES) data was available for 469,835 UKB participants.The dataset was generated by the Regeneron Genetics Center [26].Details about the production and QC for the WES data are described previously [26].We restricted the WES analysis to rare variants (MAF < 1%).

Rare variant annotation
Rare variants in WES data were annotated using Variant Effect Predictor (v.112) provided by Ensembl [27].We defined LoF variants as those with predicted consequences: splice acceptor, splice donor, stop gained, frameshift, start loss, stop loss, transcript ablation, feature elongation, or feature truncation.Missense variants were annotated using AlphaMissense [28] and REVEL [29] plugins and included if they had an AlphaMissense score ≥ 0.7 or REVEL score ≥ 0.75.All annotation was conducted based on GRCh38 genome coordinates.

Gene-based rare variant association studies
For testing groups of rare variants, genotype matrices were first transformed into a binary variable describing whether samples carry a variant of a given class as follows:

,
Where  02 is the minor allele count observed for subject  at variant  in the gene and  is the number of variants in the gene.We carried out two gene-based tests: Burden test and sequence kernel association test-optimized (SKAT-O) [30].The burden test is a mean-based test that assumes the same direction of effects for all variants within a gene.On the other hand, SKAT-O employs a weighted average of the burden test and SKAT [31], the latter a variance-based test that does not lose power when variants have opposing directions of effect.
Association tests were performed for each gene and rare variant class, including separately LoF variants, missense variants with an AlphaMissense score ≥ 0.7, and missense variants with a REVEL score ≥ 0.75, using Martingale residuals as the phenotype as in the common variant analyses.We excluded genes with fewer than 10 variant carriers to ensure the reliability of our analyses.A gene-wide significance threshold was established at =7.4× 10 !% based on the Bonferroni method accounting for the number of genes, variant classes, and statistical methods.Gene-based analyses were carried out using the SKAT package (v.2.2.5) in R.
To characterize the impacts of gene burden in significant genes, we compared lifespan survival depending on gene burden using Kaplan-Meier survival curves, log-rank tests, and Cox proportional hazard regression analyses.Additionally, we performed Cox proportional hazards regression to assess the effect of each rare variant in a gene.The survival (v.3.6.4) package in R was utilized for the survival analysis.

Phenome-wide association studies
For gene-wide significant genes, we conducted phenome-wide association studies (PheWAS) of variant carrier status across 1,670 phenotypes in the UKB derived from binary, categorical, and continuous traits.Phenotypes included the International Classification of Disease 10 (ICD-10) codes, family history (e.g.father's illness, father's age at death), blood count (e.g.white blood cell count), blood biochemistry (e.g.Glucose levels), infectious diseases (e.g.pp 52 antigen for Human Cytomegalovirus), physical measures (e.g.BMI), cognitive test (e.g.pairs matching) and brain measurements (e.g.subcortical volume of hippocampus).For ICD-10 codes, we excluded phenotypes from the following ICD-10 chapters: 'Injuries, poisonings, and certain other consequences of external causes' (Chapter XIX), 'External causes of morbidity and mortality' (Chapter XX), 'Factors influencing health status and contacts with health services' (Chapter XXI), and 'Codes for special purposes' (Chapter XXII).The ICD-10 codes were then converted into Phecodes (v.1.2) [32] which combine correlated ICD codes into a distinct code and improve alignment with diseases commonly used in clinical practice.For binary traits, we removed phenotypes with fewer than 100 cases, and for continuous traits, those with fewer than 100 participants were excluded.Depending on the phenotype, we employed various regression models including binary logistic regression, ordinal logistic regression, multinomial logistic regression, and linear regression.All analyses included age, sex, and first five PCs as covariates.Phenome-wide significance threshold was set at =2.9 × 10 !& based on the number of phenotypes.

Variant allelic fraction
To investigate whether some gene-level associations are enriched for somatic variants, we computed the variant allele frequency (VAF) for each heterozygous sample, reporting the mean VAF and VAF distribution per gene per variant class.VAF is defined as the number of reads with an alternate allele divided by the read depth at a given variant position.We also calculated the confidence interval for the mean VAF per gene using 10,000 bootstrap samples to ensure robust statistical analysis.

Discussion
In this study, we report several known and novel findings related to genetic risks associated with longevity analyzing 393,833 European participants from the UKB.In the common variant GWAS, three independent loci associated with increased mortality risk were identified.In the gene-based analysis of rare non-synonymous variants, 17 genes had their burden/SKAT-O test associated with longevity.
Consistent with previous reports, rs429358, determining the APOE-ε4 allele dosage, was associated with decreased lifespan across both sexes.APOE-ε4 is well known for its associations with Alzheimer's Disease [33] and cardiovascular disease [34].In our dataset, the proportion of ε4 carriers was significantly higher for deaths caused by 'Disease of the circulatory system' and 'Diseases of the nervous system' compared to the general prevalence of ε4 carriers, which could explain the effect of ε4 on longevity.Examining the subcategories of these ICD-10 chapters, 'Disease of the circulatory system' includes cardiovascular disease (I51.6), while 'Diseases of the nervous system' covers Alzheimer's disease (G30).We also identified a GWS association at the ZSCAN23 locus, which had not been previously reported.Our colocalization analysis revealed that the longevity-associated signal colocalizes with a ZSCAN23 eQTL in pancreatic tissue with increased expression observed in minor allele carriers.Although the role of ZSCAN23 remains unclear, recent studies have linked its expression to pancreatic tumors, supporting our colocalization findings [35].For sex-specific GWAS, a GWS association specific to males was found between MUC5AC and MUC5B, which highly colocalizes with a MUC5B eQTL in lung tissue and many studies have linked this variant to pulmonary disease like idiopathic pulmonary fibrosis [36,37] and COVID-19 [38,39].Previously reported SNP associations with longevity were concordant in our dataset but none of these passed the GWAS suggestive threshold (=1.0× 10 !& ) except for those at the APOE locus.This phenomenon likely resulted from previous studies relying on proxy data such as parental age at death, which may capture a different set of genetic factors than direct proband mortality data.In our gene-based rare variant analysis, 17 genes achieved gene-wide significance (  <7.4× 10 !% ) in either the burden or SKAT-O test.Four of these, TET2, ATM, BRCA2, and BRCA1, were reported in a previous rare-variant analysis of longevity in UKB [9].We identified 13 novel genes associated with longevity-CKMT1B, ASXL1, DNMT3A, SF3B1, PTEN, SOX21, TP53, SRSF2, C1orf52, CHL1, IDH2, RLIM, and TERT-when assessing variants causing genetic LoF or missense variants classified as pathogenic by REVEL or AlphaMissense.Of note, LoF and missense variant analyses identified mostly separate genes with only one overlap (TET2).This supports the use of both categories in rare variant analyses and may indicate that missense variants as classified by AlphaMissense capture a wider range of variation missed when only assessing LoF variants, which are generally interpreted as resulting in haploinsufficiency.Importantly, missense variants may lead to increased or decreased protein function.In our analyses, IDH2 was not gene-wide significant with the burden test (=1.9×10 !( , Table 1) but was highly significant with SKAT-O (=5.3× 10 !'+ , Table 1).Since SKAT-O does not lose power when variants have differing directions of effect, this suggests that different mutations in IDH2 can lead to either increased or decreased longevity.These results underline the gain in information achieved when studying rare missense variants as well as LoF using appropriate statistical techniques.Strikingly, most of the genes we identified carrying longevity-associated rare variants have been previously linked to cancer.TET2, ASXL1, DNMT3A, and SF3B1 are all known to harbor causal leukemia variants [40][41][42][43], and somatic variants in SRSF2 have been described in myelodysplastic syndrome [44].ATM, BRCA2, and BRCA1 mutations have been well characterized in breast, ovarian, and other cancers [45][46][47].RLIM appears to be a regulator of estrogen-dependent transcription, an important pathway in breast cancer [48], and has been recently described as a potential tumor suppressor [49].PTEN and TP53 are well studied due to their critical role in genomic stability and are the two most mutated genes in human cancer [50].IDH2 is also frequently mutated in many kinds of cancer [51].The antisense long noncoding RNA SOX21-AS1, but not SOX21, has been linked to oral, cervical, and breast cancer [52][53][54].A recent study found potential for CKMT1B expression as a prognostic .biomarker in glioma [55]; similarly, alterations in CHL1 expression have been associated with development and metastasis of many types of cancer [56].Finally, variation in both the coding and promoter sequences of TERT has been associated with a variety of cancer types [57,58].Our PheWAS results also suggest that most of these genes are associated with cancer, specifically blood-based tumors such as myeloid leukemia.Combined with the common ZSCAN23 locus we identified, associated with pancreatic tumors, this points to cancer being the major genetic factor currently affecting lifespan in UKB.This is consistent with a previous study of healthspan that found cancer to be the first emerging disease in over half of disease cases in UKB [59].These results likely reflect the characteristics of the cohort, comprised of predominantly middle-aged individuals, with age-at-death ranging from 40.9 to 85.2 years and last-known ages between 52.6 and 88.7 years.For sex-specific rare variant analyses, we identified six novel genes (CDKN1A, PTPRK, COA7, TG, NMNAT2 and PITRM1) in males and three genes (PORCN, UGT1A8 and OLIG1) in females.Some of these genes have been found to associate with sex-specific diseases.In one study, advanced prostate cancer patients had a higher frequency of a variant on the 3'UTR of CDKN1A [60] and the gene has received attention as a potential therapeutic target for prostate cancer [61].PORCN is located on the X chromosome and mutations on it can cause Goltz-Gorlin Syndrome [62], but it has also been found to regulate a signaling pathway that controls cancer cell growth [63].UGT1A8 expression is altered in endometrial cancer [64] and amino acid substitutions in it may modulate estradiol metabolism leading to increased risk of breast and endometrial cancer [65].Since UKB collected DNA from peripheral blood mononuclear cell samples, we explored whether the variants were potentially of somatic origin, picked up by WES genotyping due to CHIP.The VAF distribution of variants included in our analysis emphasizes that several associations are likely linked to CHIP, and notably include the well-established CHIP-related genes TET2, ASLX1, DNMT3A, SF3B1, TP53 and SRSF2.While WES heterozygote genotypes for these variants will not include all variants with some degree of CHIP within these genes (as evidenced by many more individuals having non-zero alternate allele count at these locations, data not shown), it does capture CHIP-related somatic variants sufficiently to establish robust associations with longevity.In UKB the mean duration between the primary visit (blood draw date) and death is currently 9.2 years (± 3.8) and suggests that WES screening for CHIP variants may be used as a precision health tool to contribute to earlier cancer detection by assessing individuals with higher susceptibility risks.In addition to known cancer variants, such as breast cancer-related BRCA1/BRCA2, our study highlights novel associations that should be considered in cancer susceptibility screenings.By combining large-scale GWAS with rare variant analysis, this study enhances our understanding of the genetic basis of human longevity.Our results emphasize the importance of understanding the genetic factors driving the most prevalent causes of mortality on a population level, highlighting the potential for early genetic testing to identify germline and somatic variants that place some individuals at risk of early death.Understanding the biological pathways through which these genes influence cancer and aging, as well as the environmental factors interacting with these pathways, will be essential for developing therapeutic targets aimed at extending healthy lifespan.Our study's implications thus extend beyond genetics, as they touch on the broader aspects of health care, public health policy, and preventive strategies against age-related diseases.In conclusion, this study enhances our understanding of the genetic basis of human longevity by combining large-scale GWAS with detailed rare variant analysis.The novel loci identified warrant further exploration to understand their biological roles and interactions with .environmental factors, which will be crucial for unraveling the complex nature of aging and 433 developing strategies to mitigate its adverse effects.434 .       .

Figure
Figure 1.Common variant GWAS of longevity.(A) Manhattan plot.(B) The proportion of cause of death for the top 4 categories, each accounting for more than 5% of total deaths) (C) Association of causes of death with APOE-ε4 genotype.(D) Locuszoom and (E) colocalization plots at the ZSCAN23 locus, colocalized with ZSCAN23 eQTL in pancreatic tissue in GTEx.PP4: posterior probability of colocalization.

1 .
Figure 1.Common variant GWAS of longevity.(A) Manhattan plot.(B) The proportion of cause of death for the top 4 categories, each accounting for more than 5% of total deaths) (C) Association of causes of death with APOE-ε4 genotype.(D) Locuszoom and (E) colocalization plots at the ZSCAN23 locus, colocalized with ZSCAN23 eQTL in pancreatic tissue in GTEx.PP4: posterior probability of colocalization.
−"#$ !" (p) ZSCAN23 eQTL −"#$ !" c ir c u la t o r y s y s t e m T h e r e s p ir a t o r y s y s t e m T h e n e r v o u s s y s t e m N e o p la s m s T h e c ir c u la t o r y s y s t e m T h e r e s p ir a t o r y s y s t e m T h e n e r v o u s s y s t e

Figure 2 .
Figure 2. Rare variant burden association with longevity, considering loss-of-functions (A) and Alpha Missense pathogenic variants (B).Novel genes are highlighted in red.

Supplementary Figure 5 .Supplementary Figure 6 .
Sex-stratified rare variant burden association with longevity considering 3 categories for each sex: Loss-of-function (A), Alpha Missense (B), and REVEL (C).Novel genes are highlighted in red..Sex-stratified rare variants SKAT-O association with longevity considering 3 categories for each sex: Loss-of-function (A), Alpha Missense (B), and REVEL (C).Novel genes are highlighted in red.

Table 1 . Significant genes for rare variants association with burden and SKAT-O tests (𝒑<7.4×
! ).Gene names in bold font represent novel associations. .

Table 2 . Lead variant association per gene among significant genes in the burden and SKAT-O tests.
Only variants with at least 3 minor alleles are reported.

Table 1 . Demographics of European ancestry in the analyses.
.Supplementary

Table 2 . Significant genes for burden and SKAT-O association of rare variants, considering missense variants with REVEL > 75.
Gene names in bold font represent novel associations in REVEL. .Supplementary

Table 3 . Significant genes for burden and SKAT-O association of rare variants in males.
Genes in bold font represent novel associations in males. .Supplementary

Table 4 . Significant genes for burden and SKAT-O association of rare variants in females
. Genes in bold font represent novel associations in females. .Supplementary

Table 5 . Lead variant association per gene among significant genes in the burden and SKAT-O tests.
Only significant variant associations with at least 3 minor allele counts per gene are reported in this table.

Table 6 . Mean variant allelic fraction per gene across participants included in the corresponding gene-level Burden/SKAT-O analysis.
.